Author: Xavier López
Date: December 2020
Objective: This notebook contains all the code that aims to wrangle data from different data sources and provide
The dataset that I will be wrangling (and analyzing and visualizing) is the tweet archive of Twitter user @dog_rates, also known as WeRateDogs.
WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because *they're good dogs Brent*. WeRateDogs has over 4 million followers and has received international media coverage.
WeRateDogs downloaded their Twitter archive and sent shared it to Udacity students to use in this project. This archive contains basic tweet data (tweet ID, timestamp, text, etc.) for all 5000+ of their tweets as they stood on August 1, 2017.

The goal of this project is to wrangle WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations. The Twitter archive is great, but it only contains very basic tweet information. Additional gathering, then assessing and cleaning is required for "Wow!"-worthy analyses and visualizations.
We will be using data from three different sources:
1.Twitter archive file: It contains the coure of our data, it is available in data/twitter_archive_enhanced.csv.
The WeRateDogs Twitter archive contains basic tweet data for all 5000+ of their tweets, but not everything. One column the archive does contain though: each tweet's text, which is used to extract rating, dog name, and dog "stage" (i.e. doggo, floofer, pupper, and puppo) to make this Twitter archive "enhanced." Of the 5000+ tweets, tweets have been filtered for tweets with ratings only (there are 2356).
The data extractión has been done programmatically, but the author didn't do a very good job. The ratings probably aren't all correct. Same goes for the dog names and probably dog stages (see below for more information on these) too. I'll need to assess and clean these columns if you want to use them for analysis and visualization.
2.Additional data on Twitter API: We will query Twitter's API to obtain the retweet count and favorite count information for each tweet which are not available in the Twitter archive file, this additional data can be gathered by anyone from Twitter's API. Well, "anyone" who has access to data for the 3000 most recent tweets, at least. But we, because we have the WeRateDogs Twitter archive and specifically the tweet IDs within it, can gather this data for all 5000+. .
The setting up of the API is done in this notebook.
3.Twitter image predictions file: This data is Available at data/image-predictions.tsv
Every image in the WeRateDogs Twitter archive has been ran through a neural network that can classify breeds of dogs. The results: a table full of image predictions (the top three only) alongside each tweet ID, image URL, and the image number that corresponded to the most confident prediction (numbered 1 to 4 since tweets can have up to four images).
In the following code reads data from the twitter archive, for every id queries the twitter API and generates a file collecting the entire twitter api data (tweet_json.txt) and red into dataframe format. Finally image predictions are also red.
The result is that data from different sources map to the following dataframes:
import pandas as pd
import numpy as np
import tweepy
import json
import sys
from datetime import date
import calendar
%matplotlib inline
import matplotlib.pyplot as plt
import altair as alt
from altair_saver import save
pd.options.display.max_colwidth = 2500
Code
#IMPORT CODE TO USE A PROGRESS BAR (used in generate_tweetdata_api)
from __future__ import print_function
import re
class ProgressBar(object):
DEFAULT = 'Progress: %(bar)s %(percent)3d%%'
FULL = '%(bar)s %(current)d/%(total)d (%(percent)3d%%) %(remaining)d to go'
def __init__(self, total, width=40, fmt=DEFAULT, symbol='=',
output=sys.stderr):
assert len(symbol) == 1
self.total = total
self.width = width
self.symbol = symbol
self.output = output
self.fmt = re.sub(r'(?P<name>%\(.+?\))d',
r'\g<name>%dd' % len(str(total)), fmt)
self.current = 0
def __call__(self):
percent = self.current / float(self.total)
size = int(self.width * percent)
remaining = self.total - self.current
bar = '[' + self.symbol * size + ' ' * (self.width - size) + ']'
args = {
'total': self.total,
'bar': bar,
'current': self.current,
'percent': percent * 100,
'remaining': remaining
}
print('\r' + self.fmt % args, file=self.output, end='')
def done(self):
self.current = self.total
self()
print('', file=self.output)
def get_tokens(api_keys_jsonfile):
with open(api_keys_jsonfile) as json_file:
data = json.load(json_file)
return data["consumer_key"], data["consumer_secret"], data["access_token"], data["access_token_secret"]
def initialize_api(api_keys_jsonfile):
consumer_key, consumer_secret, access_token, access_token_secret = get_tokens(api_keys_jsonfile)
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth,wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
return api
def generate_tweetdata_api(api,tweet_ids):
error_list = []
tweet_id = []
total = len(tweet_ids)
progress = ProgressBar(total, fmt=ProgressBar.FULL)
with open('data/tweet_json.txt', 'w+') as outfile:
for tweet_id in tweet_ids:
try:
progress.current += 1
progress()
tweet = api.get_status(tweet_id)
json.dump(tweet._json, outfile) #json.dump creates a json from the tweet
#the outfile is an OPTIONAL parameter to export it direclty on a file
outfile.write('\n')
except:
error_list.append(tweet_id)
print("'data/tweet_json.txt' has been generated")
def get_tweet_api_exported_file():
DF_tweets = pd.DataFrame()
with open('data/tweet_json.txt', encoding='utf8', mode='r') as json_file:
# iterate through each line
for line in json_file:
try:
# read each json line into a dictionary
data = json.loads(json_file.readline())
df_tweet = pd.DataFrame(data)
df_tweet = df_tweet[df_tweet.index == "name"].set_index("id").copy()
DF_tweets = DF_tweets.append(df_tweet)
except:
print(str(json_file) + " could not be successfully processed")
return DF_tweets[["created_at","id_str","text","truncated","source","user","retweet_count","favorite_count","lang"]]
print("\nRead twitter archive file")
df_ta = pd.read_csv('data/twitter-archive-enhanced.csv')
print("\nGetting data from twitter api")
api = initialize_api('config/api_keys_tokens.txt')
generate_tweetdata_api(api, df_ta.tweet_id)
print("\nReading data from the exported file")
df_api = get_tweet_api_exported_file()
df_api
print("\nData from 'config/api_keys_tokens.txt' has been successfully red")
print("\nRead predictions data")
try :
df_pred = pd.read_csv('data/image-predictions.tsv', delimiter="\t")
print("\nPredictions data has been succesfully red")
except:
print("\n someting went wrong")
df_ta.head(5)
df_api.head(5)
df_pred.head(5)
df_ta.head(5).T
Recall that this dataset is the core of our data, first we will begin analyzing which columns contain valuable information and dropping the useless ones.
The columns:
retweeted_status_timestamp
Contain mostly nulls and not useful information so we will drop them
Test
len(df_ta["retweeted_status_id"].unique()) #get number of distinct values
df_ta["retweeted_status_id"].isnull().sum() #get numbers of nulls in column
df_ta["retweeted_status_id"].unique()[:10] #get first 10 distincts elements to see how the non na data is
Those columns are not actually completely useless, when they are not null it is because the tweet is actually retweeted.
A different issue is that the information of one feature (dog stage) is stored across multiple columns (doggo, floofer, pupper and puppo), this information should be encoded in a single column.
See below information on dog stages

The Dogtionary explains the various stages of dog: doggo, pupper, puppo, and floof(er) (via the #WeRateDogs book on Amazon)
df_ta[["doggo", "floofer", "pupper", "puppo"]]
Test
df_ta[["rating_numerator","rating_denominator"]]
Test
set(df_ta.source)
There are 4 different values for source, but those names are too long, we will map them to 4 more easy to read categories:
- web client
- iphone
- vine
- tweet deck
Test
set(df_ta.name)
The expanded_urls column is giving more than one vlaue for tweets with more than one image, and those values are repeated separated by columns.
I will take a simplified approach where we will get a single url for each tweet (the first element of the list or first photo).
Test
df_ta.expanded_urls[10].split(",")
df_ta.expanded_urls[4].split(",")
df_ta.timestamp[0].split(" ")
set([x.split(" ")[1] for x in df_ta.timestamp])
Code:
df_ta_dropretweets = df_ta.copy()
df_ta_dropretweets = df_ta_dropretweets[df_ta_dropretweets.retweeted_status_id.isnull()]
df_ta_dropretweets = df_ta_dropretweets[df_ta_dropretweets.retweeted_status_user_id.isnull()]
df_ta_dropretweets = df_ta_dropretweets[df_ta_dropretweets.retweeted_status_timestamp.isnull()]
Test
len(df_ta_dropretweets["retweeted_status_id"].unique()) #get number of distinct values
len(df_ta_dropretweets["retweeted_status_user_id"].unique()) #get number of distinct values
len(df_ta_dropretweets["retweeted_status_timestamp"].unique()) #get number of distinct values
Code
df_ta_usefulcols = df_ta.copy()
drop_cols = ["in_reply_to_status_id","in_reply_to_user_id","retweeted_status_id","retweeted_status_user_id","retweeted_status_timestamp"]
df_ta_usefullcols = df_ta.drop(drop_cols, axis = 1)
df_ta_usefullcols.T
Get the dog_stage
Recall that dog stage is stored across multiple columns (doggo, floofer, pupper and puppo), this information should be encoded in a single column.
Those fields are one-hot encoded, we should reverse this encoding
Test
set(df_ta.doggo)
set(df_ta.floofer)
set(df_ta.pupper)
set(df_ta.puppo)
Code
dog_stage = df_ta_usefullcols[["doggo","floofer","pupper","puppo"]].replace("None","").apply(lambda x: ''.join(x.astype(str)),axis=1)
df_ta_dstage = df_ta_usefullcols.copy()
df_ta_dstage["dog_stage"] = dog_stage
Test
set(dog_stage)
There is an unexpected issue, some entries are categorized with more than one dog_breed by the data source, this is due to the fact that more than one dog_stage appears on the text:
Test
df_ta_dstage[df_ta_dstage["dog_stage"] =="doggofloofer"]
df_ta_dstage[df_ta_dstage["dog_stage"] =="doggopupper"]
df_ta_dstage[df_ta_dstage["dog_stage"] =="doggopuppo"]
Since there are only 14 missclassifications (double classifications) in dog_stage we can manually fixing looking at the text
Test
df_ta_dstage[df_ta_dstage["dog_stage"] =="doggofloofer"]
df_ta_dstage.loc[df_ta_dstage.index == 200,'text'].iloc[0]
df_ta_dstage.loc[df_ta_dstage.index == 200,'dog_stage'] = "floofer"
Code:
df_ta_dstage[df_ta_dstage["dog_stage"] =="doggopuppo"]
df_ta_dstage.loc[df_ta_dstage.index == 191,'text'].iloc[0]
df_ta_dstage.loc[df_ta_dstage.index == 191,'dog_stage'] = "puppo"
Code:
df_ta_dstage[df_ta_dstage["dog_stage"] =="doggopupper"]
After reading the texts we realize that there are pictures that contain puppers&doggos at the same time:
Test:
df_ta_dstage[df_ta_dstage.index ==1113].text
We were not expecting to have images with more than one dog breed, but we have them, for those images we will relabel them as pupper_doggo.
However there are other images that needed relabeling since they were referencing only one dog.
The changes will be:
where the list elements make reference to the index of the df_ta_dstage dataframe:
Code:
# relabel pupper
for index in [460, 575, 733]:
df_ta_dstage.loc[df_ta_dstage.index == index,'dog_stage'] = "pupper"
for index in [705,956]:
df_ta_dstage.loc[df_ta_dstage.index == index,'dog_stage'] = "doggo"
for index in [531,565, 778, 822, 889, 1063,1113]:
df_ta_dstage.loc[df_ta_dstage.index == index,'dog_stage'] = "pupper_doggo"
df_ta_dstage.dog_stage = df_ta_dstage.dog_stage.replace("","Unknown")
df_ta_dstage = df_ta_dstage.drop(columns = ["doggo", "floofer", "pupper", "puppo"], axis = 1)
df_ta_dstage.T
Code:
df_ta_rating = df_ta_dstage.copy()
df_ta_rating["rating"] = 100*df_ta_rating.rating_numerator/df_ta_rating.rating_denominator
df_ta_rating = df_ta_rating.drop(["rating_numerator", "rating_denominator"], axis = 1)
df_ta_rating.T
Code:
df_ta_source = df_ta_rating.copy()
set(df_ta_source.source)
dict_source = dict()
dict_source['<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>'] = "web_client"
dict_source['<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>'] = "iphone"
dict_source['<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>'] = 'vine'
dict_source['<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>'] = 'tweet deck'
df_ta_source.source = [dict_source[x] for x in df_ta_source.source]
df_ta_source.T
Code:
def map_lowercase_unknown(word):
if (word[0].isupper()):
return(word)
return("Unknown")
df_ta_names = df_ta_source.copy()
df_ta_names["names"] = [map_lowercase_unknown(x) for x in df_ta_names.name]
set(df_ta_names.names)
df_ta_names.T
The expanded_urls column is giving more than one vlaue for tweets with more than one image, and those values are repeated separated by columns.
I will take a simplified approach where we will get a single url for each tweet (the first element of the list or first photo).
Code:
df_ta_urls = df_ta_names.copy()
df_ta_urls.expanded_urls = [str(x).split(",")[0] for x in df_ta_names.expanded_urls]
df_ta_urls.T
Code:
df_ta_time = df_ta_urls.copy()
df_ta_time.timestamp
df_ta_time["date"] = [x.split(" ")[0] for x in df_ta_time.timestamp]
df_ta_time["time"] = [x.split(" ")[1] for x in df_ta_time.timestamp]
df_ta_time["hour"] = [int(x[:2]) for x in df_ta_time.time]
df_ta_time["day"] = [int(x[8:10]) for x in df_ta_time.date]
df_ta_time["month"] = [int(x[5:7]) for x in df_ta_time.date]
df_ta_time["year"] = [int(x[:4]) for x in df_ta_time.date]
df_ta_time["calmonth"] = df_ta_time['month'].map(str) + '-' + df_ta_time['year'].map(str)
def get_day_name(row):
return(calendar.day_name[date(row["year"],row["month"],row["day"]).weekday()])
df_ta_time["day_of_week"] = df_ta_time.apply(get_day_name, axis = 1)
df_ta_time.T
The cleaning process for the twitter archive file is done
df_ta_fullclean = df_ta_time.copy()
Test:
df_api.head(5).T
Trough the API we have collected more data than we actually need, we have the same information from the twitter archive.
The columns id_str, text and source are redundant and hence we should drop them.
The columns lang and user are categorical information with a single unique value on the data, so it is not an informative columns, so we will also drop them.
Finally truncated offers new information but is not really relevant, so it will also be dropped.
In conclusion we will keep the columns:
Those columns do not cointain NaNs.
Test:
set(df_api.lang)
set(df_api.user)
set(df_api.favorite_count.isnull())
set(df_api.retweet_count.isnull())
Code:
df_api_fullclean = df_api.reset_index()[["id", "retweet_count", "favorite_count"]].copy()
df_api_fullclean
Test:
df_pred.head(5)
Recall that this dataset contains a tweet_id and the results of a classification method to inform breeds of dogs.
Our goal in this step will be to get a table with a unique mapping between tweet_id and a breed of dog. The neural network is able to make predictions of not only breads of dogs, that is why there is a column informing if a prediction category is or is not a dog.
The predicting model has given three outputs with their corresponding prediction confidence. The task of mapping a tweet_id with a dog breed is not as trivial as getting the most confident predicted category because this predicted category is not necessarly a dog.
Predicting not dogs breeds categories can happen due to two different reasons:
The image was not framing a dog.
Example the tweet_id: 666051853826850816 has as most confident prediction box_turtle which is accurate and not a dog.

The image was framing a dog but the neural network was not able to predict the bread of the dog and focused on something else.
Example the tweet_id: 666268910803644416 has as predictions desktop_computer, desk and bookcase (which none of them are dog breeds)

Predicting dog breeds correctly but not as a first choice Example the tweet_id: 666057090499244032 has as most confident prediction: shopping_cart with 96.2% confidence, shopping_basket with 1.4% confidence and finally golden_retreiver with 0.8% confidence.

df_pred[df_pred.p1_dog == False].head(5)
Note also that we are not using at all the column df_pred, some tweets use more than one image but the predictive model has been run only on one of thse, so this information is not valuable for further analysis.
Test:
set(df_pred.img_num)
len(set(df_pred.tweet_id))==df_pred.shape[0] #check if the number of rows is the same of the number of unique ids
The goal is to obtain the dog breed information for each tweet_id, this information is stored across multiple columns.
The implementation consist on if the most confident category is a dog breed, get the first predicted category, else if the second category is a dog breed get the second instead, else if the third is a dog breed use the third predicted category and if it is not a dog breed inform the predicted dog breed category as Unknown:
Code:
def get_dog_predicted_category(row):
if row["p1_dog"]:
return(row["p1"])
elif row["p2_dog"]:
return(row["p2"])
elif row["p3_dog"]:
return(row["p3"])
else:
return("Unknown")
df_pred.apply(get_dog_predicted_category, axis = 1)
df_pred_clean = pd.DataFrame()
df_pred_clean["tweet_id"] = df_pred["tweet_id"]
df_pred_clean["jpg_url"] = df_pred["jpg_url"]
df_pred_clean["dog_breed"] = df_pred.apply(get_dog_predicted_category, axis = 1)
df_pred_clean.set_index("tweet_id", inplace = True)
df_pred_clean
Before the dog_breed conclusion for each tweet_id with the rest of the data form other sources it would be interestsing to analyze how many tweets we have not been able to classify as a dog breed.
Test:
df_pred_clean.groupby(['dog_breed']).count().rename(columns={"jpg_url": "counts"}).sort_values(by = "counts", ascending = False)
df_gbreeds = df_pred_clean.groupby(['dog_breed']).count().rename(columns={"jpg_url": "counts"}).sort_values(by = "counts", ascending = False)
df_gbreeds["count_percentage"] = 100* df_gbreeds.counts / sum(df_gbreeds.counts)
df_gbreeds
Only 15% of the predictions can not be classified as dog breeds, i.e. 85% of the predictions have been succesfully mapped to a dog breed.
15% of entries is reasonably small to be able to discard it and do not take it into account for further analysis.
Unfortunately we can not access easily if the prediction is accurate and we do not have the validation/test accuracies of the predictive model, so for simplicity we will assume from now on that the most confident dog_breed is the ground truth class.
Finally we are done cleaning the predictions file:
Code:
df_pred_fullclean = df_pred_clean.copy()
Next we will merge the three following datafrmaes into a single one:
- df_ta_fullclean
- df_api_fullclean
- df_pred_fullclean
Test:
df_ta_fullclean.head(5).T
df_api_fullclean.head(5).T
df_pred_fullclean.head(5).T
Code:
df_merge1 = df_ta_fullclean.merge(df_api_fullclean,how = "left", left_on = "tweet_id", right_on = "id" ).drop(columns = "id")
df_fullmerge = df_merge1.merge(df_pred_fullclean, how = "left", left_on = "tweet_id", right_index = True)
df_fullmerge.T
Code:
df_fullmerge.to_csv("data/twitter_archive_master.csv", index=False)
Code:
#read data
df_fullmerge = pd.read_csv("data/twitter_archive_master.csv", )
df_fullmerge.T
They seem to be linearly directly correlated, i.e. tweets with a lot of retweets have also a lot of favorites.
!pip install imgkit
import altair_saver
source = df_fullmerge
chart = alt.Chart(source).mark_circle(size=100).encode(
x='retweet_count',
y='favorite_count',
color='dog_stage',
tooltip=['tweet_id','dog_stage','retweet_count','favorite_count']
).interactive().properties(
width=650,
height=400
)
chart.save('img/scatter.html')
chart
December 2015 was the month with more tweets published, from 1:00 to 5:00 hours.
heat_count = alt.Chart(df_fullmerge).mark_rect().encode(
alt.X('hours(timestamp):O', title='hour of day'),
alt.Y('yearmonth(timestamp):O', title='date'),
alt.Color('count(tweet_id):Q', title='Count of tweets'),
alt.Tooltip(['yearmonth(timestamp)','hours(timestamp)','count(tweet_id):Q'])
)
heat_count.save('img/calendar_counts.html')
heat_count
It is interesting to notice that the amount of acumulated retweets is not liked with the amount of tweets, the tweets that had most virality were published in Jan 2017 at 3:00 and Jan 2016 at 20:00.
heat_retweet = alt.Chart(df_fullmerge).mark_rect().encode(
alt.X('hours(timestamp):O', title='hour of day'),
alt.Y('yearmonth(timestamp):O', title='date'),
alt.Color('sum(retweet_count):Q', title='Count of content retweets'),
alt.Tooltip(['yearmonth(timestamp)','hours(timestamp)','sum(retweet_count)'])
)
heat_retweet.save('img/calendar_retweet.html')
heat_retweet
Virality does not seem to be related to the dog breed, we can find outliers in any dog breed that went viral, for example labrador retriever (Jun 2016, 75153) outperformed by far all the other months for the same dog_breed.
Furthermore there are less published dog breeds like Eskimo dog (Jun 2016, 55956) or standard poodle (Jan 2017, 36372 retweets) that are rarely published but when they go viral they perform really well
import altair as alt
from vega_datasets import data
source = df_fullmerge
chart = alt.Chart(source).mark_circle(
opacity=0.8,
stroke='black',
strokeWidth=1
).encode(
alt.X('yearmonth(timestamp):O', axis=alt.Axis(labelAngle=0)),
alt.Y('dog_breed:N'),
alt.Size('retweet_count:Q',
scale=alt.Scale(range=[0, 4000]),
legend=alt.Legend(title='Annual Global Deaths')
),
alt.Tooltip(['dog_breed','yearmonth(timestamp)','names','tweet_id:N','retweet_count']),
alt.Color('dog_breed:N', legend=None)
).properties(
width=500,
height=1000
).transform_filter(
alt.datum.Entity != 'All natural disasters'
)
chart.save('img/bubles.html')
chart
Most of tweets seem to not get viral at all, the tweets that go viral are few but they go really farm from the bunch when they are viral.
#df_fullmerge[["retweet_count","favorite_count"]].hist(bins = 50, figsize = [15,4])
x = df_fullmerge["retweet_count"]
num_bins = 50
fig, ax = plt.subplots()
fig.set_size_inches(12,5)
# the histogram of the data
n, bins, patches = ax.hist(x, num_bins, density=0)
ax.set_xlabel('Retweet Count')
ax.set_ylabel('Count Tweets')
ax.set_title('Histogram of Retweet Count')
plt.savefig('img/hist_retweets.png')
plt.show()
Are the top three informed dog breeds in our dataset.
import altair as alt
from vega_datasets import data
source = df_fullmerge
chart = alt.Chart(source).mark_bar().encode(
y='sum(retweet_count):Q',
tooltip = ['dog_breed','sum(retweet_count)', 'count(tweet_id)'],
x=alt.X('dog_breed:N', sort='-y')
)
chart.save('img/top_dog_breeds.html')
chart